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Abstract 



One of the main concepts in quantum physics is a density matrix, which is a symmetric 
positive definite matrix of trace one. Finite probability distributions can be seen as a 
special case when the density matrix is restricted to be diagonal. 

We develop a probability calculus based on the sc more genera il distributions that in- 
cludes definitions of joints, conditionals and formulas that relate these, including analogs 
of the Theorem of Total Probability and various Bayes rules for the calculation of posterior 
density matrices. The resulting calculus parallels the familiar "conventional" probability 
calculus and always retains the latter as a special case when all matrices are diagonal. We 
motivate both the conventional and the generalized Bayes rule with a minimum relative 
entropy principle, where the Kullbach-Leibler version gives the conventional Bayes rule and 
Umegaki's quantum relative entropy the new Bayes rule for density matrices. 

Whereas the conventional Bayesian methods maintain uncertainty about which model 
has the highest data likelihood, the generalization maintains uncertainty about which unit 
direction has the largest variance. Surprisingly the bounds also generalize: as in the con- 
ventional setting we upper bound the negative log likelihood of the data by the negative 
log likelihood of the MAP estimator. 

Keywords: generalized probability, probability calculus, density matrix, quantum Bayes 
rule. 

1. Introduction 

The main notion of a "mixture state" used in quantum physics is a density matrix. States 
are unit vectors u ( || || 2 = 1)- For the sake of simplicity we assume in this paper that 
the underlying vector space is R n (for finite n). Each state u (unit column vector in W 1 ) 
is associated with a dyad uu T G W nxn . The dyad uu T may be seen as a one-dimensional 
projection matrix which projects any vector onto direction u. These dyads are the ele- 
mentary events of a generalized probability space. It is useful to keep the corresponding 
"conventional" probability space in mind, which consists of a finite set of size n. The n 
points are the elementary events and a probability distribution may be seen as a mixture 
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over the n points, i.e. such a probability distribution is specified by n real numbers that are 
bigger than zero and add to one. In the generalized case there are infinitely many dyads 
even if the dimension n is finite. 1 

Density matrices generalize finite probability distributions. They can be defined as 
mixtures of dyads W = ^2% UiWiwJ where the mixture coefficients ioi are non-negative and 
sum to one. There may be an arbitrary number of components in the mixture. However, any 
n dimensional density matrix can be decomposed into a mixture of n orthogonal eigendyads, 
one for each eigenvector (see Figure 1.1). Mixtures of dyads are always symmetric 2 and 
positive definite. A density matrix W can be depicted as an ellipse which is an affine 
transformation of the unit ball: {Wu : 1 1 tt 1 1 2 = 1} (See Figure 1.2). A dyad is a degenerate 
ellipse with a single axis in direction ±it that has radius one (Figure 1.1). Note that dyads 
have trace one: 

tr(mi T ) = tr(u T u) = = 1. 

Therefore, density matrices also have trace one. 

A density matrix W assigns generalized probability ti(Wuu T ) to each unit vector u 
and its associated dyad uu T (see Figure 1.2). This probability is independent of how W is 
expressed as a mixture and can be rewritten as u T Wu. Note that if the symmetric positive 
definite matrix A is viewed as a covariance matrix of a random cost vector c, then u T Au 
is the variance of the cost along direction u, i.e. the variance of c • u. 

If a = (ai, . . . , a n ) is a probability vector, then the n-dimensional matrix diag(a) with 
vector a as its diagonal is a density matrix. Note that diag(a) = J2i a i e i e Ji where the 
e$ are the standard basis vectors. Thus conventional probability distributions are special 
density matrices where the eigensystem is restricted to be the identity matrix. In this 
paper we develop a Bayesian style analysis for the case when the eigensystem is allowed to 
be arbitrary. 

Perhaps the simplest case to see that something unusual is going on is the uniform 
density matrix, i.e. ^ times identity I. This density matrix assigns probability ^ to every 
unit vector, even though there are infinitely many of them. However, note that the sum of 
generalized probabilities of any set of n orthogonal dyads is = 1. As a matter of fact 

1. The machinery for infinite dimensional vector spaces is available. However, in this paper we start with 
the simplest finite dimensional setting. 

2. In quantum physics complex numbers are used instead of reals. In that case "symmetric" is replaced by 
"hermitian" and all our formulas hold for that case as well. 
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Figure 1.2: Figure (a) depicts a red ellipse (Wii : ||u||2 = 1} for some density matrix W. The green curve 
shows part of the unit ball. The blue figure-eight is a plot of the generalized probabilities in direction u, i.e. 
tr(Wuu T )u. Figure (b) plots a 3-dimensional density matrix (red ellipsoid) and its associated generalized 
probability surface (in blue). 

for any density matrix W and any set of n orthogonal directions ttj, the total generalized 
probability is one (see Figure 1.3) 

tr(W u iU J) = tr(W ^ u.uj) = tr(W) = 1. (1.1) 



i=l 




This means that while in the conventional case probabilities are additive over the points in 
the set, in the generalized case probabilities are additive over orthogonal sets of dyads. 

In this paper we use density matrices as generalized priors and develop a unifying 
Bayesian probability calculus for density matrices with rules for translating between joints 
and conditionals. All formulas retain the conventional case as the special case when the 
matrices are diagonal. In previous work (War05) we derived a generalized Bayes rule based 
on the minimum relative entropy principle, but no satisfactory probabilistic interpretation 
was given for this rule. This Bayes rule fits nicely into our new calculus and we can interpret 
it using the notion of generalized probability introduced above. 

For any fixed orthonormal system Ui, one can use the dyads u^uj as elementary events 
of a conventional probability space. As already discussed, any density matrix can be seen as 
assigning conventional probabilities to these events that sum to one. Thus if the orthonor- 
mal system is fixed, generalized probability space is reduced to conventional probability 
space over the vectors in the chosen system. Our approach is fundamentally different in 
that we use density matrices to maintain uncertainty over all orthonormal systems. Our 
conditional density matrices are part of the probabilistic system specified by a generalized 
joint probability distribution. In particular, our conditioning method leads to generaliza- 
tions of the theorem of total probability that involve density matrices. 

In (TRW05) various on-line learning updates were generalized from vector parameters 
to matrix parameters. Following (KW97), the updates were derived by minimizing the loss 
on the current instance plus a divergence to the last parameter. In this paper we use the 
same method for deriving a Bayes rule for density matrices, which becomes the foundation 
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Figure 1.3: For a set of orthogonal directions Ui and a density matrix W , the sum of generalized probabilities 
tr(WuiuJ) over the set is one. Figure (a) shows this for 2-dimensional case: the red ellipse is a density 
matrix W, the blue figure-eight is a plot of the generalized probablity tr(Wuu T ) around the circle, and 
for any two orthogonal vectors iti and U2, tr(WuiuJ) + tr(Wu2uJ) = 1. Figure (b) shows the three- 
dimensional case: for any three orthogonal directions iti, U2 and 113, the probabilities a, b and c of the three 
associated dyads sum to one. 

of our generalized probability calculus. When the parameters are probability vectors over 
the set of models, then the "conventional" Bayes rule can be derived using the relative 
entropy as the divergence (e.g. (Zel98; KW99; SWRL03)). Analogously, we now use the 
quantum relative entropy, introduced by Umegaki, to derive the generalized Bayes rule. 

The new rule uses matrix logarithms and exponentials to avoid the fact that symmetric 
positive definite matrices are not closed under the matrix product. The rule is strikingly 
similar to the conventional Bayes rule and retains the latter as a special case when the 
matrices are diagonal. Various cancellations occur when the conventional Bayes rule is 
applied iteratively and as we shall see, similar cancellations happen with the new rule (See 
Section 9.2). The conventional Bayes rule may be seen as a soft maximum calculation and 
the new rule as a soft calculation of the eigenvector with the largest eigenvalue (see figures 9.1 
and 9.2). In figures 9.3 and 9.4 we plot the projections of posterior onto the eigendirections 
of the fixed datalikelihood matrix D(y|M). The projection onto the eigendirection of the 
largest eigenvalue is a sigmoid like function. 

The mathematics applied in this paper are most commonly used in quantum physics. 
For example, the assignment of generalized probabilities ti(Wuu T ), can be seen as the 
outcome of a quantum measurement of a system in mixture state W being acted upon 
by a measurement apparatus described by the dyad uu T . It is tempting to call the new 
rule the "quantum Bayes rule". However, we currently do not have a quantum physical 
interpretation of this rule. In particular, the state collapse following a measurement does 
not explicitly appear in our calculus, also our Bayes rule can not be described as a unitary 
evolution of the prior state. The term "quantum Bayes rule" also has been claimed before in 
(SBC01), where they derive a rule that describes uncertainty information about unobserved 
quantum measurements of a composite system as a density matrix. 

Our work is most closely related to a paper by Cerf and Adami (CA99), where, in the 
context of quantum information theory, a formula was proposed for the conditional density 
matrix that uses the matrix exponential and matrix logarithm. This special formula appears 
in our calculus and is now put in a more general context. We hope to transfer many 
techniques developed in Bayesian Statistics based on the conventional Bayes rule to the 
context of generalized probabilities. 
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The paper is organized as follows. Section 2 recalls the relevant matrix algebra facts. 
Section 3 introduces density matrices and generalized probability distributions and states 
Gleason's theorem that establishes an equivalence between them. Then, in Section 4 we 
introduce a generalization of the matrix product that is commutative and preserves 
positive defmiteness. This operation is central to our calculus. Section 5 introduces 
generalized joint distributions. Section 6 discusses marginalizing the joints. Next, in Section 
7 we give formulas for conditional density matrices. Section 8 presents generalizations of 
the Theorem of Total Probability. In Section 9 we present the founding piece of this work, 
the Bayes rule for density matrices, its derivation and various properties. We also discuss 
how the new Bayes rule for density matrices is in some sense the conventional Bayes rule in 
an optimally chosen eigensystem. Section 10 summarizes all the rules in our calculus and 
their justifications. In the conclusion section we discuss again how our new calculus relates 
to quantum physics and possible generalizations of it. 

2. Facts on Matrices and Basic Notation 

In this paper generalized probability distributions, conditionals and data likelihoods are 
represented as symmetric positive definite matrices. We will now discuss some relevant 
matrix algebra facts. 

The basic fact that we use a lot is the eigendecomposition of symmetric matrices: 

n 

S = ScrS T = <7j SisJ 

i=l 

This says that every such matrix can be written as a product of an orthogonal matrix of 
eigenvectors S times a diagonal matrix of eigenvalues 7 times <S T . Alternatively it can be 
written as mixture of eigendyads formed from the eigenvectors where the eigenvalues act as 
mixture coefficients. 

Any symmetric positive definite 3 matrix C can be seen as a covariance matrix of some 
random cost vector c £ W 1 , i.e. C = E ((c — E(c)(c — E(c)) ). A covariance matrix C 
can be depicted as an ellipse {Cu : \\u\\2 = 1} centered at the origin, where the eigenvectors 
form the principal axes and the eigenvalues are the radii of the axes (see Figure 1.2). 

Note that a covariance matrix C is diagonal if the components of the cost vector are 
independent. The variance of the cost vector c along a vector u, that is the variance of the 
dot product c T u, has the form 

Y(c T u) =E ((c T u - E(c T u)) 2 ^ 

=E (((c T - E{c T ))u) T {{c T - E{c T ))u)) 
=E (u T (c - E(c))(c - E(c)) T )uj 
=u T Cu. 

The variance along an eigenvector of the covariance matrix is the corresponding eigenvalue. 
Using this interpretation, the matrix C may be seen as a mapping from the unit ball to 

3. We use the convention that positive definite matrices have non-negative eigenvalues and strictly positive 
definite matrices have positive eigenvalues. 
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M>o, i.e. unit vector u is mapped to u Cu. Figure 1.2 depicts the resulting figure-8-like 
plots in 2 and 3 dimensions. A second interpretation of the scalar u T Cu is the square 
length of u w.r.t. the basis y/C, that is u T Cu = u 1 \fC\fCu = || v / C'it|||. 

The trace tr(£?) of an arbitrary square matrix E is the sum of its diagonal elements En. 
It is a linear operator. Recall that tv(EF) = ti(FE) for any matrices E € M. nxm , F € 
W mxn . Also, for symmetric square matrices, tr(ST) = Y^ij SijTij, thus trace can be seen 
as a dot product between matrices. The trace has a useful cycling property: for arbitrary 
matrices E,F,G with compatible dimensions tr(EFG) = ti(FGE) = tr(GEF). From 
this follows that trace is rotation invariant in the sense that for any orthogonal matrix 14, 
ti(lAElA T ) = tr(U. T lAE) = ti(E). If S is symmetric, setting U to be the eigensystem of S 
results in the observation that trace is equal to the sum of eigenvalues of a matrix. Also, 
for any orthogonal system 4 tij, 

n n 

tr(S) = tr( J] u iU J S) = uj S Ui . 

i=i i=i 

V v ' 

7 

Therefore if S is symmetric positive definite, then tr(S) is the total variance along any set 
of orthogonal directions. Recall that density matrices have trace one and therefore in this 
case this total variance is always one (See Figure 1.3). 

The matrix exponential exp(S) of the symmetric matrix S = ^ i <jj SisJ is computed 
by exponentiating the eigenvalues and leaving the eigenvectors unchanged: exp(S^) = 

exp(cj) SisJ . The matrix logarithm log(A) is defined similarly but now A must be 
strictly positive definite. Clearly, the two functions are inverses of each other. It is im- 
portant to remember that exp (S + T) = exp(5) exp(T) only holds if S and T commute 
i.e. ST = TS. 5 However, the following trace inequality, known as the Golden- Thompson 
inequality 6 (Bha97), always holds: 

tr(exp(5) exp(T)) > tr(exp (S + T)) for symmetric S and T, (2.1) 

where equality holds iff both symmetric matrices commute. 

3. Generalized Probability Distributions and Density Matrices 

In quantum physics a dyad uu T represents a pure state and density matrices are mixture 
states. As we shall see density matrices can be interpreted as generalized probability dis- 
tributions over the set of dyads. Note that in this paper we want to address the statistics 
community and use linear algebra notation instead of Dirac notation. Any probability vec- 
tor (P(Mi)) can be represented as a diagonal matrix diag(P(Mj)) = J2i P{Mi) e^e 1 ", where 
Si denotes the ith standard basis vector. This means that conventional probability vec- 
tors are special density matrices where the eigenvectors are fixed to be the standard basis 
vectors. 

4. A set of unit vectors m is orthogonal iff £\ UiuJ — I. 

5. This occurs iff the two symmetric matrices have the same eigensystem. 

6. Note that the Golden-Thompson inequality does not generalize to three matrices, i.e. there exist sym- 
metric S, T, U, s.t. tr(exp(S) exp(T) exp(C7)) ^ tr(exp (5 + T + U)). 
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For the sake of simplicity we assume that our vector space is W 1 . However, everything 
discussed in this section holds for separable finite or infinite dimensional real and complex 
Hilbert spaces. 

A function /j,(u) from unit vectors u in W 1 to R is called a generalized probability dis- 
tributions if the following two conditions hold: 

• Vtt, < n(u) < 1. 

• If «i, . . . , u n form an orthonormal system for R n , then ^ n{ui) = 1. 

Gleason's Theorem states that there is a one-to-one correspondence between generalized 
probability distributions and density matrices 7 in W ixn : 

Theorem 1 (Gle57) Let n > 3. 8 Then any generalized probability distribution fi on W 1 
has the form fi(u) = ti(Wuu T ), for a uniquely defined density matrix W. 

It is easy to see that every density matrix defines a generalized probability distribution. 
The other direction, is highly non-trivial. 9 As discussed in the introduction, the dyads 
uu T function as elementary events. One may ask what corresponds to arbitrary events 
and how probabilities are defined for them. In the conventional case, an event is a subset 
of the domain which can be represented as a vector in {0, l} n . In the generalized setting, 
an event is a symmetric positive definite matrix P with eigenvalues in {0, 1}. Each such 
matrix P with eigendecomposition Yli=i PiPi 1S a projection matrix for a subspace of R n 
and its probability w.r.t. a distribution W is defined as the sum of the probabilities of the 
elementary events PiPj comprising P: 

k 

tr(WP) = Y J ^(Wp tP J). 

i=i 

Interpreting P as a covariance matrix of some random variable, we can also expand W 
and sum the variance along its eigendirections Wi weighted by the eigenvalues Ui which are 
probabilities: 

. ..... variance 

n n probability ^ 

tr(WP) = tr^Wj WiivJP) = ^ "wT wj Pwi . (3.1) 

1 i S v ' 

1=1 1=1 . , 

expected variance 

Random variables are defined in an analogous way. In the conventional case a random 
variable associates a real value with each point. Now a random variable is an arbitrary 
symmetric matrix S. Such matrices have arbitrary real numbers as their eigenvalues and 
trace tr(WS) when S is expanded becomes the expectation of the random variable w.r.t. 
density W: 

probability 
outcome ^ 

tr(WS) =tr(V^cw7) = Y, ^ sjWsi. (3.2) 



expected outcome 



The core of the original proof of Gleason's Theorem was for R 3 (Gle57), and he then generalized the 
proof to separable real and complex Hilbert spaces of dimension n > 3. 

A slightly different version of this theorem that is based on "effects" instead of dyads holds for dimension 
2 as well (CFMR04). 

However, if dyads are replaced by "effects" then the proofs are much simpler (CFMR04). 
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As discussed before, the conventional case of the expectation calculation is always retained 
as a special case when all the matrices are diagonal (i.e. fixed eigensystem I). In quantum 
physics the expectation calculation tr(WS) has the following interpretation: an instrument 
is represented by a hermitian matrix S and tr(WS) is the expected value of a quantum 
measurement of the mixed state W with instrument S. The eigenvalues oi of the instrument 
represent the possible numerical measurement outcomes. Each one of those outcomes is 
observed with probability sj Wsi , where Sj is the associated eigenvector of the instrument 
matrix S. 

In real quantum systems the measurement causes the mixtures state W to collapse into 
one of the orthogonal states {s\sj , . . . , s n , s^}: the successor state is SisJ with probability 
sJWsf 

probability 

measurement ^ Ss, ^ 

w — y, s ^ Ws * s ^ ■ 

collapse 1 

V v ' 

expected state 

As we shall see, the expected measurement calculations play an important part in our 
calculus. However our update rules for density matrices (such as our Bayes rule) do not 
explicitly include a collapse in the above sense. 

Note that some of the equations above hold for arbitrary decompositions into a linear 
combination of dyads of any size. For example (3.1), holds for any decomposition W = 

uJi WiwJ , i.e. the uii may be negative, the Wi may be non-orthogonal, and the size of the 
decomposition may be larger than n. If the tUi are non-negative, then they form a probability 
vector. Similarly, (3.2) also holds for any decomposition S = J2i a i s i s J ■ However, quantum 
measurements are always based on an orthogonal system. Furthermore, orthogonal systems 
are special in that the orthogonal decomposition of a density matrix W = ^ • uji w^wj 
attains the minimum of the entropy Y2i ~ UJ i m over all possible decompositions of W 
(Inequality (11.86) in (NCOO)). 

A question that naturally arises is whether we can model the generalized probability 
distributions defined above with a conventional probability space. In other words, is there 
a conventional probability space and two mappings: one that maps density matrices to con- 
ventional probability distributions and the other mapping dyads to events of this probability 
space. The requirement on these two mappings is that the conventional probability calcu- 
lations using the images of density matrices and dyads under these mappings satisfy the 
definition of the generalized probability distributions given above. Essentially, it is known 
that conventional probability spaces cannot satisfactorily model generalized probabilities, 
but the details are rather involved. This topic has received considerable attention in the 
quantum physics community and we refer readers to (HolOl) for an extended discussion of 
impossibility results. Here we only give one simple attempt to model density matrices with 
a conventional probability space and show that the two natural mappings fail to satisfy the 
requirements. 

A natural interpretation of a density matrix is to view it as a parameterized density over 
the unit sphere. We claim that if n(u) is the uniform density on the sphere, then for any 
symmetric positive definite matrix A £ M nxn of trace n, u J Au jx{u) is also a conventional 
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probability density on the sphere: 

A 



J u T ^^aiaiaj u fi(u)du = ^^ai J (u J otj) 2 fi(u)du 

i i 

1 

tr(A) I (w T (^, . . • , -^) T ) 2 fi(u)du = f (u) 2 fi(u)du . 



In the second equality we used the fact that ix{u)du is uniform and therefore the integral 
of (u T a,i) 2 is the same as the integral of the squared dot product of u with uniform vector 

vys' • • • ' y/7i> ■ 

We modeled density matrices as conventional probability densities over the sphere. Now 
the natural mapping from dyads to events in the conventional probability space (the sphere) 
maps uu T to {u, —u}. However the probability of the latter sets of size 2 is zero with 
respect to the conventional probabilities densities we defined on the sphere. In particular 
the probability on any n orthogonal dyads does not sum to one. 



4. Commutative Matrix Product Operation 

It is well known that the product of two symmetric positive definite matrices might be 
neither symmetric nor positive definite (see Figure 4.1). In this section we define a commu- 
tative "product" operation between symmetric positive definite matrices that does result 
in a symmetric positive definite matrix. Our first definition of this operation requires the 
two matrices to be strictly positive definite. We then extend the definition to arbitrary 
symmetric positive definite matrices and prove many properties of this product. 

For two symmetric and strictly positive definite matrices A and B, we first define the 
as: 

sym.pos.def. 

, A 

sym. sym. 

/ v , ~ V 

sym.pos.def. sym.pos.def. 

AQ B := exp(log A +log B ), (4.1) 

where here the exponential and logarithm are matrix functions. The matrix log of both 
matrices produces symmetric matrices which are closed under addition and the matrix 
exponential of the sum returns a symmetric positive definite matrix. See Figure 4.1 for a 
comparison of matrix product and 0. 

Note that we expressed the operation between symmetric strictly positive definite 
matrices as a + operation between symmetric matrices. Similarly, for any two arbitrary 
symmetric matrices S and T, 

S + T = log(exp(5) exp(T)). 

The operation was used in (Ale02) to define a "product" between two linear transfor- 
mations that is commutative. In this paper we use to define conditional density matrices 
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Figure 4.1: The matrix product of two positive definite 
matrices does not preserve positive definiteness. For 
two matrices A and B we plot their ellipses Au,Bu 
and figure eights tv(Auu T ) u, tr(Buu T ) u (for unit 
u). Both ellipses are very thin, i.e. the ratio between 
the two eigenvalues of each matrix is 100. We also plot 
the ellipse ABu and the curve tr(ABuu T ) u. The 
latter curve consists of two figure eights, the larger 
one constitutes the part where the trace is positive 
and the smaller and skinnier one is the part where 
the trace is negative. This means that AB is not 
positive definite any more. The product is also not 
symmetric because the min/max value of %r{ABuu Y ) 
does not correspond to the axes of the ellipse. Finally, 
the corresponding plots for A Q B indicate that this 
matrix is symmetric and positive definite. 



Figure 4.2: When the ellipses A and B don't 
have the same span, then A B lies in the in- 
tersection of both spans. In the depicted case the 
intersection is a degenerate ellipse of dimension 
one (blue line). This generalizes the following 
intersection property of the matrix product when 
A and B are both diagonal (here of dimen- 
sion four): (AB) ii i 7^ iff A iti / and Bi,; / 0. 



diag(A) 


diag(B) 


diag(AB) 











a 











b 





a 


b 


ab 



and generalizations of the Bayes rule. A similar path was followed by (CA99) for defin- 
ing conditional density matrices of composite systems. We also give a motivation for the 
operation based on the minimum relative entropy principle (as was done in the conference 
paper (War05)) and our probability calculus includes the formula of (CA99) for composite 
systems as a special case. 

Note that the formula for in Equation (4.1) is not defined if some of the eigenvalues 
of A or B are zero. We now rewrite the operation using the Lie- Trotter formula and then 
extend it to arbitrary positive definite matrices. The Lie- Trotter formula (see e.g. (Bha97)) 
is the following equation: 



exp(E + F) = lim (exp(_E/n) exp(F /n)) n , any square matrices E, F. 
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A B AB BA (A" 2 B" 2 f (B m A m f AQB 



Figure 4.3: The behavior of the limit formula for operation. We can see that the additional figure eights 
indicating negative definiteness are smaller for (A 1 ^ 2 B 1 ^ 2 ) 2 than for AB. As n increases, the additional 
figure eights shrink further and lim n ^ ao (A 1 ^ n B 1 ^")" = AQB becomes positive definite. Also, AB and BA 
are fairly different from one another. The matrices (A 1/2 B 1/2 ) 2 and (B 1/2 A 1/2 ) 2 are already more similar 
and the difference between the two multiplication orders decreases with n until in the limit AQB = B&A. 

By choosing E = log A and F = log B, for symmetric and strictly positive definite A and 
B, we obtain: 

exp(log A + log B) = lim (A 1 ' n B 1 l n ) n . 

As n increases, (A l l n B l l n ) n gets closer and closer to being positive definite and symmetric. 
The first couple iterations of the limit formula are plotted in Figure 4.3. See (Ale02) for 
additional plots. Notice that the limit is defined even when A and B have zero eigenvalues. 
We therefore extend the definition of to arbitrary symmetric positive definite matrices A 
and B : 

AQB := lim {A 1 / n B 1 / n ) n . (4.2) 

n— »oo 

From now on we use the above exended definition of 0. Numerous properties of this 
operation are given below. 

Theorem 2 For any symmetric positive definite matrices A, B, C the following holds: 
0P1. Intersection property: 

range(A B) = range(A) n range(-B). 

where the range of a matrix is the linear subspace spanned by the columns of the 
matrix. This property generalizes the intersection properties for products of diago- 
nal matrices (which model conventional probability distributions): the product of two 
diagonal matrices with the characteristic vectors of two subsets as diagonals gives a 
diagonal matrix formed from the characteristic vector of the intersection (See Figure 
4.2). 

0P2. Let Ra be a matrix whose columns form an orthonormal basis for the range of A, 
i.e. Ra £ M. nxk and R a Ra = Ik> where k is the dimensionality of the range of 
A. Define Rb analogously. In a similar fashion Rac\B w M contain the basis for the 
intersection of ranges. Let log + denote the modified matrix logarithm that takes the 
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log of non-zero eigenvalues but leaves the zero eigenvalues unchanged. This operation 
can be also defined by the following formula: 1 ® 

log+ A = R A log(R\AR A ) R\. (4.3) 

With this notation, can be written as 

A B = R AnB exp(i^ nB (log+A + log+ B)R AnB ) R AnB . (4.4) 

OP 3. A B = AB if A and B commute. 

0P4- is commutative, i.e. A B = B A. 

0P5. The identity matrix is the neutral element, i.e. A J = A. 

0P6. (cA) B = c(A B), for any scalar c > 0. 

OP7. A0 A -1 = I for invertible A. Also, A0 A + = P A , where A + denotes the pseudoin- 
verse and P A is the projection matrix 11 for range(A). 

OP8. is associative, i.e. (A B) C = A (B C). 

OP 9. Monotonic convergence of the limit defining 0: 

Vn > 1 : tr(A 1/(n+1) B 1/(n+1) ) n+1 < tr(A 1/n B 1/n ) n 

OP10. tr(A B) < tr(AB), where equality holds iff A and B commute. In particular, for 
any unit u tr(A uu T ) = ti(Auu T ) iff u is an eigenvector of A. 

OP 11. For any unit direction u G range(A), A uu T = e uT ( log+ A ^ u uu T . 

OP12. For any unit direction u and eigendecomposition J2i cti a i a l of a strictly positive def- 
inite matrix A, 

ti{Auu T ) = ^2(u T ai) 2 ai, andtr(AQuu T ) = ]J a\ uTat) \ 

i i 

i.e. the matrix product corresponds to an arithmetic average and the product to a 
geometric average of the eigenvalues of A. 

OP13. det(A B) = det(A) det(S), which is the same as for the normal matrix product. 

OP14- For any orthogonal system ui, we have J^ti^A UiuJ) = det(A). 

10. Note that when the rank k of A is zero, then one still can define the projections in a consistent manner. 
In this case Ra is of dimension n x 0, and the matrices R\Ra and log (R\ER a) are of dimension 
x for any E £ K nxn . Also it is natural to define RaER\ as the n x n zero matrix 0. With this 
definition, the r.h.s. of (4.3) is when A is 0. 

11. Note that P A = R A R T A . 
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OP 15. For any unit direction u, tr((A B) uu ) = tr(A uu ) tr(£? uu ). 

0P16. For any unit direction u G rangeA, tr(A + 0ti« T ) = j^^^T) , where A + denotes 
the pseudoinverse. 

Proof Properties 0P1 and 0P2 follow from results in (Kat78) or Theorem 1.2 of (Sim79). 
Here we only prove that range(A B) C range(A) n range(B). We can split the limit 
defining as follows: 

AqB = lim (A 1 / n ,B 1 / n ) n = lim A l / n lim B 1 / n (A 1 / n B 1 / n ) n ~ 1 . (4.5) 

n^oo n-^oo n— >oo 

Here we used the property that lim E n F n = lim E n lim F n if all the limits exist. This follows 
from the corresponding sum and product properties of scalar limits and the fact that entries 
of a product matrix are finite sums of products. 

It is easy to see that lim^oo A 1 /" = Pa because the matrix power for a symmetric 
matrix corresponds to taking powers of the eigenvalues and n-th roots converge to either 
zero or one. Thus the limit is a matrix whose eigenvalues are or 1, which is a projection 
matrix. By plugging 

lim A 1 /™ = P A = P a Pa = Pa lim A l ' n 

into (4.5) we get 

AQ B = Pa lim A x l n lim B l l n {A l l n B l ' n ) n - 1 = P A {A B). 

n~^co n-^oo 

This implies that range(A0i?) C range(A). Similarly we can prove AQ B = (AQ B)Pb, 
which implies that range(A B) C range(S), and therefore range(A 0B)C range(A) n 
range(-B). 

Property OP3 can be seen from the definition of via the limit formula (4.2): when 
A and B commute, then the n copies of A l / n in (A l l n B 1 / n ) n can be gathered into A and 
similarly for B x l n . 

Properties 4-7 easily follow from the formula (4.4) for 0. 

Property OP8. For strictly positive definite A,B,C associativity reduces to the asso- 
ciativity of addition in the log domain. To show it in general we use the representation 
(4.4) of via the log + operation. Let R = R(AnB)nc = -^An(BnC)- Then: 

(AQB)QC = R exp(i? T (log+(A Q B) + log+ C)R) R T 

Now we rewrite log + (A B) using Equation (4.3): 

log+(A B) = Rahb log(R\ nB (A Q B)Rat\b) Rahb 

Substituting expression (4.4) for AQB into the above and using R\ nB RAc\B = Ik we get: 

log+(A B) = Rat\bRat\B (l°g + A + log+ B) RahbRahb ■ (4 6) 

V v ' V v ' 

Pahb Pac\B 
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Here PaciB = RAnBRAnB 1S t ne projection matrix onto the subspace range(A)nrange(.B). 
All the basis vectors of A n B n C obviously lie in the larger subspace as well, thus the 
projection leaves them unchanged and we get PahbR = R, R Pahb = R J ■ Thus: 

(AQB)QC = Rexp(R T (\og + A + log+ B + log+ C)R)R T . 

The same expression can be obtained for A {B C), thus establishing associativity. 

Property 0P9. By Fact 8.10.9 of (Ber05), we have that for any positive definite matrices 
A and B, and n > 1: 

tr{A n B n ) n+1 < tr(A n+1 B n+1 ) n . 

Now by substituting A = A 1 ^™^ 1 )), B = ^VOO+i)) the monotonicity property OP9 
immediately follows. 

Property OP10. When A and B are strictly positive definite, this inequality is an 
instantiation of the Golden-Thompson inequality (2.1). For arbitrary positive definite ma- 
trices, the property follows from the previous monotonicity property OP9. Note that there 
are symmetric positive definite matrices A, B and C s.t. tr(A B C) ^ tr( ABC). 

Property OP11. We use the expression for operation given in Equation (4.4). Since 
u G range(A), the basis of the intersection space is u itself: 

uu T A = wexp(it T (log + uu T + log + A)u)u T . 

Note that log + uu T = and that the expression inside the exponential is a scalar. The 
desired property immediately follows by moving this scalar to the front. 

Property OP 12. The expression for the trace of the matrix product is the expected 
measurement interpretation (3.2) discussed in Section 3. Note that (u 1 ai) 2 is a probability 
vector and in this expression uu 1 can be replaced by any density matrix. 

For the second trace tr(A0UM T ), we can rewrite it using OP 11 and eigendecomposition 
of A as follows: 

tr(A uu T ) = e uT ^ Au = e ^K-) 2 ^ = J\a { f ai) \ 

i 

which is a weighted geometric average of ati with weights (u 1 ai) 2 . 

Property OP13. Since det(-E-F) = det(-E) det(F), and for symmetric matrices S and 
r g M, det(5 r ) = det(S') r , we have det((A 1 / n J B 1 / n ) n ) = det(A) det(B) for all n G N. By 
Property (4.2), the limit of the l.h.s. of the last equality becomes A B and this proves 
the property. 

Property OP14. If A is not full rank, then det(A) is zero. In that case, there will be 
some Ui that is not in the range of A. For that Wj, tr(A uiuj) = 0, making the whole 
product zero. When A has full rank, we rewrite the product as follows: 

Htr(AQ Ul uJ) Y[ e Wo S A UiU J ) 

i i 

= e tr(logA£^0 = e tr(logA) = Q . = det ( A ) 
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Property 0P15. If u ^ range(A) n range(-B) °= 1 range(A B), then the property 
trivially holds because tr((A0B)0ii« T ) and either tr(A0«« T ) or tr(_B 0wu T ) are zero. 
When u G range(A) n range(B), then the property essentially follows from e a+b = e a e b : 

tr((A©S) Quu T ) 

°= 1 e uT l °S + (AQB)u ( 4 3 e u T P A n B (lo S + A+log+ B)P AnB u = g-u T (log+ A+log+ B)u 
= e vJ log+ Au ^ log+ Bu = tr(A Q ^T) tr(B Q uu Vy 



Needless to say Property OP15 does not hold if uvJ is replaced by a mixture of dyads. 
Property 0P16. Trivially follows from OP11. 



We will now discuss some of the properties further. In particular, we will show a simple 
example that demonstrates that the upper bound OP 10 can be quite loose when both 
matrices are dyads. In this case the inequality becomes: 

tr(tm vv ) < tr(im vv ) = (u ■ v) 2 . 

The right hand side can be made arbitrarily close to one by choosing almost parallel u and 
v. The left side is zero in this case, which can be seen by analyzing the intersection of the 
ranges. Dyads are rank one matrices and their ranges are lines through the origin. The 
intersection of two such lines is either only the origin or the line itself. Thus, by Property 
(OP1) it follows that uu T vv T = 0, unless u = ±v. This can also be seen from the limit 
expression in Equation (4.2): 

uu T vv T = lim ((uu T )™ (vv T )«) n = lim (uu T vv T ) n 

n^oo n— +oo 

= ( lim (u ■ i>) 2n-1 ) uv T = 0, unless u = ±v. 

n— >oo ' 

Where the last equality holds because \u-v\ < 1, when u / ±v. 

Note that the expression (4.4) for based on log + gives us a convenient method for 
computing the operation even when the matrices have some zero eigenvalues. The modified 
matrix logarithm log + is easily computed via Equation (4.3). The matrix Ra containing the 
orthonormal basis for range of A can be computed using Gram-Schmidt orthogonalization 
procedure or the QR-decomposition. To compute the basis for the intersection of range(A) 
and range(-B), we express the intersection i.t.o. the union and the orthogonal complement 
- 1 of a space: 



range(A) n range(S) = Irange(A) U range(S) 




For any matrix E, an orthonormal basis for range(-E) can be obtained by completing an 
orthonormal basis for range(-E) to an orthonormal basis for the whole space. The additional 
basis vectors needed are the basis for range(£ : )- L . Also, if we have two matrices E and F, 
we can get the range for the union of their ranges just by putting all columns of E and 
F together into a bigger matrix G = (E,F). Clearly, range(G) = range(£) U iange(F). 
Piecing all of this together gives an implementation of the operation. 
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5. Joint Distributions 

A density matrix defines a generalized probability distribution over the dyads from one 
space. However we need to consider several spaces and joint distributions over them. In 
the conventional case A, B denote finite sets {01, . . . , a„ A }, {61, . . . , b nB }, (P(aj)), (P(bj)) 
probability vectors over these sets and (P(aj, bj)) is an ua x ns dimensional matrix of prob- 
abilities for the tuple set Ax B. In the generalized case, A,B denote real finite dimensional 
vector spaces of dimension n&,n.R and D(A), D(M) are the density matrices defining the 
generalized probability distributions over these spaces. The joint space (A,B) is the tensor 
product 12 between the spaces A and B, which is of dimension n^n^. The joint distribution 
is specified by a density matrix over this joint space, denoted by Z?(A,B). 

We let D(a), D(b) denote the probabilities assigned to dyads aa ,bb T from the spaces 
A, B by the density matrices D(A), D(M>), respectively: 

D(a) :=tr(£>(A)aa T ), D(b) := tr(£>(B)b6 T ). (MJ1) 

The conventional probability distributions can be seen as diagonal density matrices. A 
probability distribution (P(aj)) on the set A is the density matrix diag((P(aj))). Also 
P(a J -) = cTdiag((P(a i )))c j . 

To introduce the joint probability D(a, b) we need the Kronecker matrix product. Given 
two matrices E and F with dimensions n x m and p x q, their Kronecker product (also 
known as direct product or tensor product) E F is a matrix with dimensions np x mq 
which in block form is given as: 

/en-F e 12 F ... e lm F\ 
e 2 \F e 22 P ••• e 2m F 



\e n iF e n2 F . . . F) 
The Kronecker product has the following useful properties: 
KP1. {E® F) T = E T ®F T . 

KP2. (E (g> F)(G (g) H) = EG ® FH if the dimensions are appropriate. 
KP3. tr(E®F) =tr( J E)tr(P). 

KP4. If symmetric matrix S has eigenvalues G{ and eigenvectors Sj and symmetric matrix T 
has eigenvalues Tj and eigenvectors tj, then S®T has eigenvalues U{Tj and eigenvectors 

Si (g) tj. 

KP5. For symmetric positive definite matrices A, B, C, D, (A (g> B) (C D) = (A 
C) (BOD). 

12. See (Bha97) for a formal definition of tensor product between vector spaces. For us, the tensor product 

of 1™* and K n » is R"*" 1 . 
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The first four properties are standard. The last property follows from the limit definition 
(4.2) of the operation. 

(A<g>B)o(C<g>D)= lim ((A <g> B)*(C <8> £>)»)" 

n^oo ' 

= lim ((A-C«)0(B»D»))" 

n— >oo ' 

= ( lim (A^B^) n ) ( lim (C»D») n ) 

The last transition which moved the limit inside the Kronecker product, follows from the fact 
that the elements of the Kronecker product matrix are just pairwise products of elements 
from the two matrices. And when all limits exist, a limit of a product of two number 
sequences is a product of limits. 

Now the joint probability D(a, b) becomes the probability assigned by density matrix 
Z?(A,B) to the jointly specified dyad (a ® 6) (a <8> b) T : 

D(a,b) :=tr(D(A,B)(a®6)(a®b) T ) = tr(£>(A, B)(aa T ® bb T )). (MJ3) 

Note that in the conventional case a joint between two sets A and B is defined over all pairs 
of points from A and B. However, in the generalized case, there are elementary events in the 
joint space that don't decompose into elementary events of the marginal density matrices, 
i.e. there are dyads in the joint space that are not of the form (a®b)(a®b) T . This is what 
quantum physicists call "entanglement" . 



6. Marginalization of the Joint via Partial Traces 

We would like to be able to perform marginalization operations on our joint density matrix 
D(A,B), i.e. obtain the density matrix D(A) from the joint matrix. In the conventional 
case the marginalization was performed by summing out one of the variables by summing 
the rows or the columns of the matrix specifying the joint probability distribution. For 
density matrices, the analogous operation is the partial trace (see e.g. (NCOO)). 

The partial trace is a generalization of normal matrix trace. It typically produces a 
matrix instead of a number and can be used to retrieve the (scaled) factor matrices from a 
Kronecker product. We denote the partial trace with tr^, where A specifies the space to be 
"summed out" . Suppose G is a matrix over the space A <g> B and A has dimension n and B 
dimension m. Thus G has dimension nm x nm and can be written in block form as a n x n 
matrix ofmxm matrices Gij : 

I G\\ G\2 ■ ■ ■ G\ n \ 

G21 G22 ■ ■ ■ G2n 



\G n \ G n 2 ■ ■ ■ G nn J 

Here we suppose that space A is R n and space B is R m . Then the two partial traces of this 
matrix are given by: 

tTA(£) = Gn + G22 + ■ ■ ■ + G nn 

mxm 
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r B (G) 

nxn 



(ix(Gu) tr(Gi 2 ) ... tr(Gm)\ 
tr(G 2 i) tr(G 22 ) ••• tr(G 2 „) 



\tr(G„i) tr(G n2 ) ... tr(G nn )/ 

In multilinear algebra partial traces are known as tensor contractions and can of course be 
generalized to the tensor product of more than two spaces. The partial trace is a linear 
operator and we now give some other useful properties: 

PT1. tr A (F (g> F) = tr(F)F, tr B (F <g> F) = tr(F)F. 

PT2. tr(G) =tr(tr A (G)) =tr(tr B (G)). 

PT3. tr A (G(I A F)) = tr A (G)F, tr A ((J A F)G) = Ftr A (G). 
PT4. tr(G(F ® F)) = tr(tr B (G(J A ® F))E). 

The first three properties are straightforward and the last one follows from the others as 
follows: 

tr(G(F ® F)) K = 2 ti(G(I A ® F){E ® /»)) = tr((E ® J B )G(I A ® F)) 

P = 2 tr(tr B ((F ® J«)G(J A ® F))) P = 3 tr(Ftr B (G(I A F)). 
We use the partial trace to define marginals as follows: 

D(A) :=trs(D(A,B)), D(B) := tr A (D(A, B)) (MJ2) 

The following lemma shows that D(A) and D(B) defined this way are again density matri- 
ces. 

Lemma 3 Partial trace of a density matrix is also a density matrix. 

Proof Symmetry is obvious. Trace one follows from Property PT2 of the partial trace: 

tr(D(A)) = tr(tr B (-D(A,B))) = tr(£>(A,B)) = 1. 
Positive definiteness follows by a similar argument: 

a T D(A)a = tr(£>(A)aa T ) P = tr(tr B (D(A, B)(aa T ® / B ))) 
P = tr(D(A, B) (aa T <8>5^t>i&7)) 

i 

= ^tr( J D(A,B)(aa T ^6A T )) 

i 

= ^(a0foi) T O(A,B)(a® 6j) > 0. 

i 

■ 

Partial traces also allow us to define objects of the type D(A, b). In the conventional case 
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this corresponds to taking one row or column out of the joint probability table. In the 
generalized case we want the following property to be satisfied: 

tr(D(A, b)aa T ) = D{a, b). (MJ5) 

This is accomplished by defining D(A, b) via the following formula: 

D(A,b) := tr B (-D(A,B)(/ A ® bb T )). (MJ4) 

Property MJ5 now follows from partial trace Property PT4. We can also see that trace of 
D(A, b) gives us the probability D(b): 

tr(D(A,6)) M = 4 tr(tr M (D(A,M)(I A ®bb T ))) P = 2 tr(tr A (Z>(A,B)(I A ® 66 T ))) (6.1) 
P I 3 tr(tr A (£)(A,B))b& T ) ^ 2 tr{D(M)bb T ) = D{b). 

A brief note on matrix properties of D(A,b). We just saw that its trace is D(b) which 
is between zero and one. Since it satisfies Property (MJ5), it is positive definite as well. 
Symmetry is also easily verified. 

Note that for any orthogonal system bi of B, 



D(A) = ^D(A,fti). 



This can be seen as follows. 



D(A) M = trfc(D(A,B)) = tn,(D(A,B)(lA®I B )) 



ti*(D(A,B)(J A ® J^bibJ)) = ^tr B (-D(A,B)(I A ® M = ^£>(A,&. 



The conventional definition of independence also naturally generalizes: Z)(A) is independent 
of D(B) if the joint density matrix decomposes: D(A,~B) = D(A) (g) D(B). It is easy to see 
that in this case we have D(a, b) = D(a)D(b) for all a, b: 

D(a, b) = tr((D(A) ® D(M))(aa T bb T )) K = 2 tr((£>(A)aa T ) (D(M)bb T )) 
K = 3 tr{D(A)aa T )tr(D(M)bb T ) = D{a)D{b). 



7. Conditional Probabilities 

The topic of conditional probabilities in this generalized setting contains many subtleties. 
First we will give the defining formulas for conditional density matrices and then discuss 
some of the issues. 

CP1. D(A|B) := £>(A,B) (Ia -D(B)) -1 (Formula (4) of (CA99) expressed with the 
operation). This formula requires D(M) to be invertible. In the conventional case, this 
corresponds to the conditional probabilities being undefined if the event conditioned 
on has probability zero. 
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CP2. D(A\b) := 

CP3. D(a|B) := D(a,M) D(B) _1 . 

CP4. Z)(a|b) := • This basic conditional probability is a straightforward generaliza- 

tion of the conventional case. It also has a quantum-mechanical interpretation. See 
Appendix A for details. 

Note that CP1 has the form: density matrix inverse of a normalization. We can also 
reexpress the other definitions in this unified form: 



CP'2. D(A\b) = tr B (r>(A,B)(J A ® bb T )) Qtr M {{I A D(M))(I A bb r ))~ 1 . 

^ ' ^ ' 





N «, 

D(A,b) 


' S v 

1 j 




CP'3. D(a\M) 


= tr A (£>(A,B) (aa T 


J B )) 0tr A ((J A £>(B))(aa T 


0/b)) -1 - 




D(a,M) 


' V v 




CP'4. D(o|6) 


= tr(D(A,B)(aa T <g 


)6b T ))0tr((/ A J D(B))(aa T $ 


Siftfe 1 "))- 1 . 




" v 

D(a,b) 


' V v 

0{b) 





We say that the joint density Z?(A,B) is decoupled if its eigendecomposition has the 
form: D(A,B) = (W A ® Wb) diag(u>)(W A ® W B ) T - Note that W A 0Wi is orthogonal iff 
both Wa and Wb are orthogonal. As we shall see later, dealing with conditionals is often 
simpler in the decoupled case. We first prove an upper bound for tr(D(A|B)) that is tight 
iff the joint is decoupled. 

Lemma 4 The following inequality holds: 

tr(D(A|B)) < n B , 

where n B is the dimensionality of space B. Furthermore, tr(D(A|B)) = n B if and only if 
the joint D(A,B) is decoupled. 

Proof The inequality is shown using properties of and partial traces: 

tr(D(A|B)) = tr(£>(A,B)0(J A 0Z)(B)- 1 )) < tr(.D(A,B)(J A D(M) )) 
P = tr(tr A (D(A,B)(J A .0(B)- 1 ))) P = tr(tr A (Z>(A,B)) 12(B)- 1 ) 

V v ' 

D(B) 

= tr(I B ) = «B- 

Remember that equality in Property OP10 of only occurs when the two matrices com- 
mute. Two matrices commute iff their eigensystems are the same. This gives us the condi- 
tion that the eigensystem of _D(A,B) must be the same as the eigensystem of J A 0_D(B) -1 . 
The latter eigensystem is clearly decoupled. Thus for equality to hold it is necessary that 
the eigensystem of D(A,M) be decoupled. 
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Now we will argue that it is also sufficient. Let the joint density matrix have eigensystem 
D(A,M) = (Wa ® Wb) diag(cj)(Wj <8> Wj). Ja commutes with any matrix on space A. 
Therefore it suffices to show that the marginal D(M) in this case has eigensystem Wb- The 
decoupled eigensystem matrix Wa <8> Wb has the following list of tia^e colums: 

Wa^Wb = (wl®w^,w{®wl,...,w 1 A ®w^ a , 



wl ® wL wl <g> wl, . . . , to? ® w« B , 



wl A <8> tok, w n k A ®wl..., wl A ® m n » 



)■ 



In correspondence with this structure we adopt a double indexing scheme for the eigenvalues 
Uij of the joint matrix D(A,M), where ujij is the eigenvalue associated with eigenvector 
w l A <S> lOg. The index i runs from 1 to %, and j runs to n®. Now the eigendecomposition 
can be written as: 

D(A,B) = ««) T ® <K) T ). 

i,3 

Partial trace is a linear operator and tiA{w k (w k ) T ® to^(w^) T ) P = 1 w^(iu^) T . Therefore: 

D(B) = tr A (U(A,B)) = ^3 <(<) T , 

j 

where LOj = Y^i^iJ- Thus we produced the eigendecomposition of the marginal D(E) and 
it indeed has eigensystem Wb- ■ 

Let us briefly discuss the connection and difference between our notion of decoupled 
joints and the notion of entanglement that appears in quantum physics. Recall that entan- 
glement, as we mentioned at the end of Section 5 corresponds to the fact that there are 
dyads cc T in the joint space (A,B) that can't be written as aa T ® bb T for any two dyads 
aa T and bb T in A and B. This notion carries over to mixed states or density matrices. In 
quantum physics, a joint density matrix D(A,B) is called separable (or non-entangled) if it 
can be expressed as (Wa ® Wb) diag(cj)(WA <8> Wb) T - The crucial difference between the 
definitions of separable and decoupled matrices is that in the separable case, Wa an d Wb 
don't have to be orthogonal. Every decoupled matrix is separable, but there are separable 
density matrices that are not decoupled. The question of deciding whether a given matrix 
is separable is known to be very difficult, whereas the question of being decoupled is easily 
decided by e.g. the condition of the above lemma. One of the reasons for which (CA99) 
introduced a conditional density matrix via Rule CP1 was to give a necessary condition for 
the separability of a joint density matrix. 

To complete the rules for conditional density matrices, we would need rules that allow 
us to marginalize the conditionals, e.g. for going from _D(A|B) to D(a\b). One obvious 
consequence of our definitions is the marginalization rule for D(A\b): 

D(a\b) C l A D ^ M i 5 tr(j3( ^ b | aflT) C l 2 tr( J D(A|6)aa T ). (MC4) 
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There don't seem to be any other simple marginalization rules for _D(A|B) and D(a\M) 
that hold for arbitrary joints. However, when the joint is decoupled, then the following 
additional marginalization rule for D(A|B) is valid: 

Lemma 5 For all decoupled joints D(A,M), 

D(a\M) = tr A (D(A|B) (aa T ® J B )). 

Proof We will compute both sides of the equation and show them to be identical. We 
begin by writing down the decomposition of the decoupled joint from Lemma 4: 

D(A,B) = X>J («4«) T ® <(<) T )- (7-1) 

i,3 

Additionally, in the same lemma, the following form for D(B) was established in this case: 

D(B) = tr A (£>(A,B)) = ^(<) T , (7.2) 

3 

where Uj = Yli^ij- According to CP3, D(a|B) = D(a, B) D(B) _1 , therefore we will 
need to compute D(a,B): 

D(o,B) M i 4 tr A (-D(A,B)(aa T /«)) ( = } tr A ((^(^i) T ^K) T )(aa T /«)) 

K = 2 ^2^i,j ^A((w l A (w l A ) T )aa T «^(<u^) T ) P = J2uij(w A • a) 2 u^(tu£) T . 
Together with (7.2), this gives: 



D(a\M)= ^ AWl ' a)2 <(<) 



h3 



Now, we proceed to the right side of the equation in the lemma. Substituting (7.1) and 
(7.2) into the formula for D(A|B) we obtain: 



D(A|B) C ^ D(A, B) (J A DiMr 1 ) = £ ^ (™X«) T ® <K) T )- (7.3) 



Using linearity of the partial trace we compute the right side as follows: 

tr A (£>(A|B)(aa T /„)) ( = 3) V ^ ti A ((w A (w A ) T ^K) T )(aa T /»)) 

" ' UJj 

T-,3 

K I*Y1 — tr A ((^(^) T )aa T <K) T ) 



UJj 
1,3 
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As discussed, obtaining D(a\b) = tr (- p (^'^)(°^^ bb )) f r0 m D(A\M) is non-trivial. In par- 
ticular, there are cases where 

tr(D(A|B) (aa T bb T )) + D(a\b), 

even when Z?(A,B) is decoupled and a and b are not eigenvectors of D(A) and D(M), 
respectively. Curiously enough, if we replace the matrix product with 0, then we always 
have 

tr(£)(A|B) (aa T ®bb T )) 

C = tr((D(A,B) (I A DiB)- 1 )) (aa T bb T )) 

°= 15 tr(D(A, B) (aa T bb T )) tr((J A .0(B)- 1 ) (aa T bb T )) 

OPW tr(£>(A,B) (aa T 0bb T )) 
tr(D(B) bb T ) 

Let us now recall the conditionals in the conventional probability theory. The full 
conditional table P(A\B) lists conditional probabilities of all pairs of elementary events 
P(di\bj). This table has the obvious properties: The sum of all entries is tib and the sum 

of any column is 1, i.e. YliP( a i\bj) = Yli P< p{b'-) ^ = ¥(b~) = Thus a conditional table is 
a column-stochastic matrix and for any such matrix we can construct a joint that has that 
matrix as its conditional table. For example we can take arbitrary probability vector p and 
multiply the i-th column of P(A\B) by pi, now the sum of each column is pi and thus the 
sum of all entries is 1 and we have a valid joint. Note that this implies that many different 
joints have the same conditional table. 

The decoupled case behaves as the conventional case, i.e. many joints correspond to the 
same conditional. A decoupled joint and conditional always have the same eigensystem and 
going from the joint to the conditional is similar to the conventional case (See (7.1-7.3) for 
details). 

However, for non-decoupled joint density matrices, i.e. when tr(Z)(A|B)) < tjb (Lemma 
4), the situation is quite different. For example, the eigenvalues of _D(A|B) can now be 
bigger than 1 (CA99). Also based on numerical experiments, we conjecture that in the 
non-decoupled case, the mapping between £)(A,B) and D(A|B) is invertible, i.e. unlike the 
conventional case, there is only one joint that gives rise to a given conditional matrix. In 
other words we conjecture that in the non-decoupled case it suffices to specify the conditional 
D( 



More specifically, we claim that the following EM-like algorithm converges to D(B) and 
c PI 

then D(A,M) = D(A|B) D(M): W is initialized to J B /n B and the estimate W t +i for 
D(M) is computed from Z?(A|B) and the previous estimate Wt as 

tr A (D(A|B) (J A W t )) 



W t +i 



tr(D(A|B) (J A W t )) 



23 



Warmuth and Kuzmin 



8. Theorems of Total Probability 

The Theorem of Total Probability is an important calculation in conventional probability 
theory. It expresses probability of some event a as an expected conditional probability of 
the elementary events hi that form a partition of the probability space B: 

P(a) = Y^P(a\b i )P(b i ). 

i 

TP1. For any orthogonal system h t of B, D(a) = X) i -D(o|ft i )D(6 i ). 

TP2. D(a) = tr(D(o|B) D(B)) 

TP3. D(A) = tr»(U(A|B) (J A D(M))). 

The first formula can be shown as follows: 

D(a) =tr(U(o,B)) = ^ bj D(a, 1)6, = J^D(o,6i) = ^ D(a|6 i ) J D(6 i ). 

i i i 

To derive the second apply 0D(B) to both sides of CP3, take trace of both sides and use 
(6.1). The proof of the third property follows the same outline but uses CP1 and MJ2. 

Conventional versions of the last two properties are obtained when the density and con- 
ditional matrices are diagonal. Note that in general these generalizations of the Theorem of 
Total Probability do not "decouple" , i.e. you cannot write them as a sum of products of con- 
ditional and marginal probabilities. However, using the Property OP10 of operation we 
can establish upper bounds on probability of D{a) in terms of "decoupled" sums that look 
like the conventional versions of the Theorem of Total Probability. If D(M) = Y^i w ? w i w J 
and D(a\M) = ^ Aj UiuJ are eigendecompositions of the corresponding matrices, then 

D(a) = tr(D(o|B) 0D(B)) < tr(D(o|B)D 

. . variance 
probability ^ 



'wT wjD(a\M)wi 

i 

y v ' 

expected variance 

pr obability D(u j) outcome 

ujD(M) Ui • (8.1) 



expected measurement 



The first version of the upper bound corresponds to using the eigendecomposition of D(M) 
and can be interpreted as an expected variance calculation with D(a\E) as the covariance 
matrix. The second version expands D(a\E) and corresponds to a quantum measurement 
of system in state Z?(B) with instrument specified by D(a\M). Letting p(bi) equal uji or 
^^^(B)^, and letting p(aj|6j) equal wj D(a\M)wi or A«, we see the correspondence of these 
upper bounds to the conventional Theorem of Total Probability. The equality only occurs 
when D{a\B) and D{B) commute. 
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9. Bayes Rules 

In the conventional setup we assume that a model Mj is chosen with prior probability 
P{Mi). The model then generates the data y with probability P(y\Mi), i.e. 

p( y ) = J2 p ( M i) p (y\ M i) 

i 

= tr(diag ((P(Mi)) diag ((P(y\Mi))) . 

The reason why we expressed P(y) as a trace of two diagonal matrices will become apparent 
in a moment. 

The generalized setup is completely analogous. There is an underlying joint space (M, Y) 
between the model space M and the data space Y. The prior is specified by a density matrix 
D(M). The data is a unit direction y in Y space that is generated by the density D(Y). 
The probability D(y) can be expressed i.t.o. the prior D(M) and data likelihood D(y\M) 
using TP2: 

D(y) = tr(U(M) D(y\M)). 

Note that in the conventional case we first chose a model based on the prior and then 
generated data based on the chosen model. In the generalized case we do not know how to 
decouple the action on the prior from the choice of the data when conditioned on the prior. 
Let us first recall the conventional Bayes rule and rewrite it in matrix notation: 

pmy) = P(M t )P(y\M t ) where p{y) = Y^ P{M] )P{y\M j ) (9.1) 

d iag (P(M^)) - trN(W)diag(J , p . ))) . 

We now present and discuss the analogous Bayes rule for the generalized setting. At 
the end of this section we present a list of all Bayes rules. 

In the generalized Bayes rule we cannot simply multiply the prior density matrix with 
the data likelihood matrix. This is because a product of two symmetric positive definite 
matrices can be neither symmetric nor positive definite (See Figure 4.1). Instead, we replace 
the matrix multiplication with operation: 

D(M\y) = D( M)® D (y\ M ) ? where D ( y ) = tr (£>( M ) D(y\M)). (9.2) 

Normalizing by the trace ensures that the trace of the posterior density matrix is one. In 
both the conventional as well as the new Bayes rule above, the normalization constant is the 
likelihood of the data. When the matrices D(M) and D(y\M) have the same eigensystem, 
then becomes the matrix multiplication. In the following subsections we derive the above 
Bayes rules from the minimum relative entropy principle. For the conventional Bayes rule 
the standard relative entropy between probability vectors is used, whereas the generalized 
Bayes rule and the crucial operation is motivated by the quantum relative entropy between 
density matrices due to Umegaki (see e.g. (NCOO)). 

We visualize the conventional Bayes rule in Figure 9.1. Repeated application of the 
rule with the same likelihood makes the posteriors increasingly concentrated on the point 
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Figure 9.1: We apply the conventional Bayes rule 
4 times, using the the same data likelihood vec- 
tor P(y\Mi) and making the current posterior 
the new prior. At first, the posteriors are close 
to the initial prior but eventually the posteri- 
ors focus their weight on arg max^ P(y\Mi). The 
conventional Bayes rule may be seen as a soft 
maximum calculation. The initial prior is in red, 
the likelihood is in green and posteriors are in 
blue. 
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Figure 9.3: We plot many iterations of the conven- 
tional Bayes rule when the same data likelihood 
{P(y\Mi)) = (.7, .84, .85, .9) is used in each itera- 
tion and the prior is (P(M t )) = (.29, .4, .3, .01). For 
each of the four models we plot the posterior prob- 
ability as a function of the iteration number. Ini- 
tially the posterior curve with likelihood .85 over- 
takes the curve with likelihood .84, but eventually 
the curve with likelihood .9 takes over both. Note 
that the curve with the largest data likelihood looks 
like a sigmoid and the one with smallest like a re- 
verse sigmoid. 



Figure 9.2: We depict several iterations of the gen- 
eralized Bayes rule. The red ellipse depicts the prior 
D(M), the green ellipse depicts the data likelihood 
matrix D(y\M), which is kept fixed on successive iter- 
ations, and the blue ellipses depict posteriors D(M\y). 
The posterior density matrices gradually move away 
from the prior and focus on the longest axis of the co- 
variance matrix. The generalized Bayes rule can be 
seen as a soft calculation of eigenvector with largest 
eigenvalue. 




20 40 60 80 100 120 



Figure 9.4: We plot many iterations of the gen- 
eralized Bayes rule when the same data like- 
lihood matrix D(y\M) is used in each itera- 
tion. As the prior £)(M) we choose the di- 
agonalized prior diag((P(Mj)) of Figure 9.3 on 
the left and as the likelihood D(y\M) we choose 
U diag((P(j/j Mi))U , where the eigensystem U is 
a random rotation matrix. Let D(M\t) denote the 
posterior at iteration t when the fixed D(y\M) is 
used in all iterations. The curves are the projec- 
tions of this posterior onto the four eigendirections 
of D(y\M) as a function of t, i.e. uj D(M\t)ui, 
where m are the columns of U. The above plot 
is qualitatively similar to the left plot. The curve 
corresponding to the largest eigenvalue of the data 
likelihood is again a partial sigmoid. 
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with maximum data likelihood P{y\Mi). Therefore this rule can be interpreted as a soft 
max-likelihood calculation. Figure 9.2 demonstrates the generalized Bayes rule. There the 
posterior gradually moves towards the eigenvector belonging to the largest eigenvalue of the 
data likelihood matrix D(y\M). Thus the new rule can be interpreted as a soft calculation 
of the eigenvector with maximum eigenvalue. 

In Figure 9.5 we depict a sequence of updates with the new Bayes rule when the data 
likelihood matrix is different in each iteration. Observe that based on the relative lengths of 
the axes (eigenvalues) and the directions of the axes (eigenvectors) in the ellipse describing 
the current data likelihood matrix, the posterior adjusts its axis lenghts and directions. 

Other Bayes rules for our calculus are listed below. They all express one conditional in 
terms of the corresponding reverse conditional. 

BR1. D(B|A) = {I A ®D(M))QD(A\M)Q(D(A) / B ) _1 , where D(A) = tr B ((J A 0.D(B))0 
D(A|B)). 

BR2. D(b\A) = D(b)D(A\b) -D(A)" 1 , where D{A) T = tr B ((J A D(B)) D(A|B)). 

BR3. D(M\a) = D ( M )® D ( a \ M ) ^ where D ( a ) T = tr(D(B) D(o|B)). 
L)\a) 

This is the Bayes rule derived in (War05) that was discussed above. 

BR4. D(b\a) = D ( b W°\ b ) where D f a \ T = 1 D(bi)D(a\bi). 
D[a) ^— ' 

i 

The summation in the normalization factor proceeds over any orthogonal system foj. 

All these Bayes rules can be easily derived as follows: first express the conditional on the left 
i.t.o. the joint by applying the definitions of conditional probability from Section 7; then 
apply these definitions again for expressing the joint in terms of the reverse conditional. 
For ©XcLixiplc 

D(M\a) °= D(B ' a) °- 3 ' D(B) ® jD(a|B) 



D{a) D{a) 

As was mentioned above, the new Bayes rule can be seen as a soft maximum eigenvalue 
calculation. We will now give an example that shows that its impossible to track the 
maximum eigenvalue without changing the eigensystem. First, suppose that we have a 
diagonal density matrix W = Yli u i e « e 7 an d another diagonal matrix S = a% eiej . 
Then tr( WS) = J^i ^i^i and this means that by changing uoi we can easily focus on the high 
Uj. Now suppose W is diagonal as before, but S has the Hadamard matrix eigensystem. 
Hadamard matrices H are square n x n matrices that have ±1 elements and satisfy the 
condition HH T = nl. Thus M= is an orthogonal matrix. Let hj be the columns of this 

orthogonal matrix derived from a Hadamard matrix and let S = J2i a i hihj . Entries of hi 
are therefore tr (eiej hjhj) = ^. Computing the trace we obtain: 

tv(WS) = J2 a i T M e i e J h j h J) = -tr(W)tr(S) = 

i,j 

This means that any diagonal density matrix W only "sees" the average of eigenvalues of 
S and is unable to focus on the highest eigenvalue. 
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Figure 9.5: Sequence of Bayes updates with the new Bayes rule (9.2): from left to right, the prior is in red; 
the first data likelihood matrix is below in green; the first posterior is above in blue, and so forth. 



9.1 Deriving the Conventional and Generalized Bayes Rule 

In this section we show how to derive the conventional Bayes rule (9.1) and the generalized 
Bayes rule for density matrices (9.2) by minimizing a tradeoff between a relative entropy 
and an expected log likelelihood. For two probability vectors x and y, the relative entropy 
is defined as A(x,y) := Sj^logf 1 - We use the convention that OlogO := which is 
justified by lim^^o x log x = 0. It is well known that A(x,y) > and that A(x,y) = iff 
x = y. 



Theorem 6 Let the prior (P(Mj)) be any probability vector and the data likelihood {P{y\Mi)) 
be any non-negative vector of the same dimension. Then 

-logP(y) = inf A(K>, (P(iWi))) - VV log P{y\Mi) : 

(uji) prob.vec. z— 



and u> = (P(Mi)P(y\Mi)/ P(y)) is the unique optimum solution. 



Proof Let the support of a vector x be the set of all indices 1 < i < n s.t. X{ ^ and denote 
this set as s(a;). For any probability vector (cjj), such that s((c<jj)) C s(P(M«)) ns(P(y|Mj)), 
we have 

- log P(y) =J> log ^ - £ UHP(vm - J> log pmp ^ Mi)/p{y) ■ 



Aa^i),(P(Mi))) A((a; i ),(P(M i )P(2/[M i )/P(2/))) 
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The precondition on the support of (u>i) assures that all three sums above are finite because 
it avoids the case WjlogO, when uji > 0. Since the l.h.s. is a constant, 

inf A {(ui) , (P(Mi))) - £ UiP(y\Mi) 

{iOi) prob.vec. i 

s((^)) Cs(P(M l ))ns(P(y|M J )) 

sup -A((a; i ),(P(M i )P(y|M i )/P(y))). 
(ajj) prob.vec. 
s((^)) Cs(P(M l ))ns(P(y|M J )) 

The sup clearly has u = {P{Mi)P{y\Mi) / P{y)) as its unique solution and the inf remains 
unchanged if the condition on the support of (wj) is dropped. This gives us the statement 
of the theorem. ■ 

This theorem can also be proven using differentiation (see e.g. (Zel98; KW97; SWRL03)). 
For the density matrix case this was done in (War05; TRW05). We now prove the corre- 
sponding theorem for density matrices in a different way. For two density matrices A and 
B, the quantum relative entropy is defined as A(A, B) := tr(A(log A — log B)). There is a 
potential problem when some of the eigenvalues of the matrices are zero. However, we will 
now reason that this definition is justified under the assumption OlogO = and A(A, B) 
is bounded iff range(A) C range(-B). 

The first term tr(AlogA) becomes ^•ccjlogaj, where the aj are the eigenvalues of 
A. This term is always finite. If B is eigendecomposed as YliPi ^^7' then the second 
term tr(AlogB) can be rewritten as YlibjAbi log/%. If range(A) C range(B), then 
Tange(B) ± C range(A)- 1 , where _L denotes the orthogonal complement space. If (3i = 0, 
then bi £ range(B) 1 - and under our assumption on range(A) this also means that bjAbi = 
0. Therefore, for all i, s.t. 0i = 0, the summand bj Abi log/3j has the form OlogO = 0. If 
on the other hand, range(A) ^ range(S), this also means range(S)- 1 ^ range(A)- 1 -. The 
eigenvectors bi with zero eigenvalues form a basis for range(S)- 1 and therefore there exists 
some bi s.t. bj Abi 7^ 0. This gives a summand of the form xlogO, with x ^ 0, and this is 
infinite. Notice that this discussion also means that 

trfAlogS) = { tr ( Alo § + - B ) when range(A) C range(S) ^ 
\ — oo otherwise 

As before the function A( A, B) is non-negative and equal zero iff both arguments agree 
(e.g. (NC00)). 

Theorem 7 Let the prior D(M) be any density matrix and data likelihood D(y\M) be any 

symmetric positive definite matrix of the same dimension. Then 

-logD(y)= inf A(W, D(M)) - tr(W log D(y\M)), 

W dens.mat. 

and W = D ( M )®^(y\ wi ') i s ^ e unique optimum solution. 
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Proof For any density matrix W s.t. range(W) C range(£)(M)) n range(.D(y|M)), we 
have 

- log D(y) = tr(W(log W - log D(M)) -tr( W(log D(y \M)) 

" v ' 

A(W,D(M)) 

- tr(W(log W - (log D(M) + log D(y|M))/D(y))). 

Since range (W) C range(D(M)) nrange( J D(y|M), tr(WlogD(M)) and tr(WlogD(y|M)) 
are both finite. Assuming that for any symmetric positive definite matrices W, A and B 

tr(W(log A+log B) = tr(W log( A0.B)), when range(VT) C range(A)nrange(S), (9.4) 

the above equality would become 

- log D(y) = A( W, D(M)) - tr( W log D(y|M)) - A( W, (D(M) D(y|M)) /U(y)). 

Since the l.h.s. is a constant, 

inf A(W,D(M))-tr(WlogD(y|M)) 
W dens. mat. 
range(W) C range(D(M)) n range(D(y|M)) 

sup -A(W,(D(M)Q D(y\M)) /D(y)). 

W dens. mat. 
range(W) C range(D(M)) n range(D(y|M)) 

The sup clearly has the unique solution D ( M )®D(v\ M ) anc [ j n f remains unchanged if the 
condition on the range of W is dropped. This gives us the statement of the theorem. 

We still need to show (9.4). Since range(VF) C range(A)nrange(B) ^ range(A0S), 

tr(Wlog(i4 0B)) ( = 3) tr(Wlog+(A0B)) 

( = 6) tr(WP An B(log + A + log+ B)P AnB ) 
= tr(PAnBWPAnB(log + A + log + B)) 
w 

(9 = 3) tr(W(logA + lo gj B)). 



We conclude with a discussion of the relationship between the conventional Bayes rule for 
probability vectors and the generalized Bayes rule for density matrices. Density matrices 
are determined by a probability vector of eigenvalues as well as an orthogonal eigensystem. 
An orthogonal system Wi turns the prior density matrix D(M) into the probability vector 
(tr(D(M) WiwJ)), which we call a pinching of D(M). Similarly the pinching of the data 
likelihood matrix D(y\M) is the vector (tr(D(y|M) WiwJ)) G [0, l] n . The idea is to express 
our Bayes rule for density matrices as the conventional Bayes rule for the pinched priors 
and likelihoods w.r.t. a certain eigensystem. That is, we want to be able to say that the 
generalized Bayes rule is the conventional Bayes rule for the "best" pinching. 
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The above outline is essentially true, but we need to pinch in the log domain. With 
Equality (9.3), Property OP11 can be extended to 

tr(A uu T ) = e u l °s Au f or any unit u and symmetric positive definite matrix A. 

(9.5) 

We call tr(A WiwJ) a remote pinching of A. Since its components satisfy tr(D(M) 
OP10 

WiW- ) < ti(D(M)wiW- ), the remote pinchings of D(M) must be normalized to form a 
probability vector. 

We can rewrite the argument of the optimization problem for the generalized Bayes rule 
based on the eigendecomposition WwW T of the density matrix W: 

A(W, D(M)) - tr(W log D(y\M)) 

= tr(cjW T (logW-logZ>(M))W) -tr(^W T (logD(y|M))W) 
= ^^(logwi - wj (log D(M)) Wi ) - ^u^wj (log D(y\M)) Wl 

i i 

( = 5) ^^(logwi - logtr(D(M) w iW J)) - ^ ^ logtr(D(y|M) w iW J) 

i i 

= A ((Wi) , (tr(£>(M) WiwJ)/Zy^) - ^tuaogtr(D(y|M) W ^ T ) - logZ w , 

i 

where the normalization Zyy = ^ ■ tr(.D(M) wjwj) does not depend on the eigenvalues. 
By Theorem 6, the above is minimized w.r.t. oj when cj = (Py^(Mi)Py^(y\Mi) j P\\>(y)) , 
where Py^(Mi) := ti(D{M)QWiwJ ) / Zyy is the normalized remote pinching of the prior and 
Pyy(y\Mi) := ti(D(y\M) WiivJ) is the remote pinching of the data likelihood matrix. 
With this optimum choice of u>, the minimization problem of the generalized Bayes rule 
simplifies to 

inf - log P w (y) - log Z w 
WW =/ 



WW 

OPlb 

WW 



OPW 

> 

WW 



inf -log(Vtr(D(M) WiivJ) tr(D(y|M) WiivJ))) 

w ' 

inf - log( V tr((D(M) D(y\M) WiwJ)) 

™ ' i 

inf - log( V tr((D(M) D(y\M)wiwJ)) 



— logtr(Z?(M) D(y\M)). 

The above inequality is tight iff W is an eigensystem of D(M) D(y\M). We conclude 
that the optimization problem for the generalized Bayes rule is optimized when W is an 
eigensystem of D(M) D(y\M) and the vector of eigenvalues u is conventional posterior 
derived from the normalized remote pinchings of the prior and the remote pinchings of the 
data likelihood. 
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9.2 Chaining of the Bayes Rule 

The conventional Bayes rule can be applied iteratively to a sequence of data and various 
cancellations occur. For the sake of simplicity we only consider two data points yi, y 2 : 

P(M , , P(M i \y 1 )P(y 2 \M l ,y 1 ) P{M l )P{y l \M i )P{y 2 \M u y 1 ) 

P(Ml|y2 ' yi) = = P(yMP(m) " 

The normalization can be rewritten as: 

P(V2\yi)P(yi) = (^nM^P(y2\M l , yi ))^2p(M t )P( yi \M l )) 

% using (9.1) 

= ^P(M i )P(2/i|M i )P( 2 / 2 |M i ,y 1 ) = P(y 2 , 2/1 ). (9.6) 

i 

Analogously, by essentially applying the generalized Bayes rule (9.2) two times we get: 
D(M|yi) D(y 2 \M, Vl ) D(M) D( yi \M) D(y 2 \M, yi ) 



D(M\y 2 , yi ) 



D(y 2 \yi) D(y 2 \y 1 )D(y 1 ) 



As in the diagonal case (9.6), the normalization can be rewritten into one term (by applying 
TP2 twice and then the generalized Bayes rule (9.2)): 



D(y 2 \ yi )D( yi ) = tr(D(M|yi)0D(y2|M,yi)) tr(U(M) D(y 



= tr( -°( M ) Q-P^il— / QD(y 2 \M, yi )) tr(D(M) D( yi \ 
V tr(£)(M)0£)(yi|* r ■ 1 ' 



= tr(D(M) D( yi \M) D(y 2 \M, Vl )) = D( yi ,y 2 ). 

Finally as in (8.1), we can upper bound the data probability D(yi, y 2 ) in terms of the 
product of the expected variances for the two trials: 

D(y 2 , yi ) = tr(U(M|yi) Q D{y 2 \M, yi )) tr(D(M) D( Vl 
< tr(D(M|yi)D(y2|M,yi)) tr(£>(M)D(yi 



9.3 Bounds 

Recall the following conventional bound for the negative log-likelihood of the data i.t.o. the 
negative log-likelihood of the MAP estimator: 

-logP(y) = -log^P(y|M l )P(M J ) 

(9.7) 

< min(- log P(y\Mi) - log P{Mi)). 

i 

We will give analogous bound for density matrices. For this we need the following inequality: 
For any unit vector m and symmetric positive definite matrix A: 

T OP10 T (9.5) -p 

— log Tn Am < — log tr(A Q mm ) = —m (logA)m. (9.8) 
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Using the fact that tr(A) > m T Am, we can now prove an analogous MAP bound for the 
generalized probabilities: 



-logD(y) = -logtr(D(y|M)0D( 

< min(-logm T (Z%|M) D(M)) ra) 

m 

(9.8) 

< min(-m T log(D(y|M) ® D{M)) m) 

m 

< min(-m T log£>(j/|M) m - m T logD(M) m). 

m 

The last inequality becomes Equality (9.4), when m G range(D(M)) n range(D(y|M)). 
Otherwise, it holds trivially because — m T log(ZJ(y|M) D(M)) m = +oo. 

Intuitively, there are two domains: the probability domain and the log probability do- 
main. The conventional bound (9.7) can also be written in the probability domain: 

P(y) >maxP(M i )P(y|M i ). 

i 

However for the generalized probability case, there does not seem to be a simple similar 
inequality in the probability domain. Throughout the paper we always notice that the 
matrix operations need to be done in the log domain. 

In the conventional case P(y) is also upper bounded by maxjP(y|Mj). For the gen- 
eralized case, the analogous formula is the following, where m and rrii are the eigenval- 
ues/vectors of D{M) and m any unit direction: 



D(y) = tr(D(y\M)QD( 
< tr(D(y|M)D(M)) 
= ^iHm] D(y\M)mi 



< max mj D 



< max m T D(y\M)m. 

m 

10. Summary of the Probability Calculus for Density Matrices 

In this section we give a summary of all the rules of our calculus. The definitions are 
indicated with := and at the end we summarize the justification for our choice of definitions. 
Table 1 shows connections between different objects and the formulas that relate them. 

10.1 Marginalization Rules for Joints of Sections 5 and 6 

MJ1. D(a) := tr(D(A)aa T ) = a T D(A)a. 

MJ2. D(A) := tr B (-D(A, B)). 

MJ3. D(a,b) :=tr(D(A,l)(a®6)(a®6) T ) = tr(D(A, B)(aa T bb 1 )). 

MJ4. D(A,b) :=tr B ( J D(A,B)(/ A 6b T )). 

MJ5. D(a, b) =tr(£>(A, b)aa T ). 
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Table 1: A series of charts summarizing the different relationships for joints and conditionals. Each edge 
references the formula stating the relationship. For symmetric cases only one formula is given and the 
corresponding edges in the chart will have the same label. 
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10.2 Conditional Probability Rules of Section 7 

CP1. D(A|B) := D(A,B) (I A D{ 

CP2 . D(A | 6) : _ D ' A ; b ' 

v 1 ' tr(D(A,b)) 
CP3. D(a|B) := £)(a,B) 0.0(B)- 1 
£>(a,&) 



CP4. D(o|6) := 



D(b) 



CP1 has the form: density matrix inverse of a normalization. Below we reexpress the 
other definitions in this unified form: 

CP'2. D(A\b) = tr B (Z>(A,B)(I A bb 1 )) tr B ((/ A D(M))(I A bb 1 ))- 1 . 
CP'3. D(a|B) =tr A (£>(A,B)(aa T 0/ B )) 0tr A ((/ A 0-D(B))(aa T 0I B )) _1 - 
CP'4. D(o|6) = tr(£>(A,B)(aa T bb T )) tr((J A Z>(B))(aa T fofe T )) _1 . 

10.3 Marginalization Rules for Conditionals of Section 7 

tr(D(A|B) (J A D(B))(ooT 6b T )) 



MCI. D(o|6) = 
MC2. D(A|6) = 



tr(Z>(B)&6 T ) 
tr B (D(A|B) (J A D(B))(J A M> T )) 



tr(£>(B)bb T ) 

MC3. D(a|B) = tr A (D(A|B) (J A £>(B))(aa T J B )) £>( 
MC4. D(a\b) = tr(D(A\b)aa T ). 

tr((D(o|B) ©£>(!)) 66 T ) 



MC5. D(a\b) = 



tr(D(B)bfo 1 



All the rules here except for MC4 require additional information for marginalization, which 
was not necessary in the conventional case. See discussion of marginalization of conditionals 
in Section 7. 

10.4 Theorems of Total Probability of Section 8 

TP1. D(a) = J2i D(a\bi)D(bi) for any orthogonal system bi of space B. 

TP2. D(A) = Yli D(A, bi) for any orthogonal system fej of space B. 
TP3. D(A) = tr B (D(A|B) (J A D(M))). 
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10.5 Bayes Rules of Section 9 

BR1. D(B|A) = (/ A ®D(B))0D(A|B)0(£)(A) Ib) -1 , where D(A) = tr B ((I A 0.D(B))0 
D(A|B)). 

BR2. D(a|B) = D(o)D(B|o) -D(B)^ 1 , where D(B) = tr A (-D(B|A) (-D(A) I B )). 
BR3. D(B\a) = D ( M )® D ( a \ M ) ^ where D ^ = tr (£>( B ) D(o|B)). 

BR4. D(6|o) = ^(°|k)-P(b) where D(a) = D(o|6j)X)(6i) and the summation is over 
D[a) ^-^ 

i 

any orthogonal system 6j. 

10.6 Summary of Justifications for the Definitions 

Note that only the rules MJ1-4 and CP rules are definitions. Everything else in our calculus 
can be derived from these. MJ1 is justified by Gleason's Theorem as discussed in Section 
3. Gleason's Theorem also justifies MJ3, where the Kronecker product provides the natural 
way to specify a joint unit (See discussion in Section 5). MJ2 is standard in quantum 
physics and tra(-D(A,B)) was shown to be a density matrix in Lemma 3. The rule is 
also compatible with the conventional case as well as with the natural generalization of 
independence discussed in Section 6. MJ4 is the natural definition of D(A, b) that satisfies 
MJ5 and is compatible with the conventional case. 

We will outline how CP2 can be motivated as a quantum relative entropy projection. For 
positive definite matrices A and B, we extend the definition of quantum relative entropy 
as follows: A(A,B) = tr(A(logA - logB) + B — A). Note that this "unnormalized" 
relative entropy, coincides with the standard one when A and B have trace one. Now CP2 
is motivated as 

D(A|6) = arginf A (W, D(A, b)) . 

W dens. mat. 

CP3 is motivated analogous to the generalized Bayes rule (See Section 9): 

D(a, B) = arg inf A(W, D(B)) - tr(W log D{a\M)). 
w 

CP1 can be motivated in a similar fashion, but now the variable is over the joint space 
(A,B): 

D(A,B)= arginf A(W, I A £>(B)) - tr(W logD(A|B)). 

W dens. mat. 

CP1 also was previously used in (CA99) to allow a suitable definition of conditional quantum 
entropy. Finally, the last rule CP4 was chosen in analogy to the conventional case. It also 
has an interpretation as two successive quantum measurements (see Appendix A). 

Historically, we first justified the generalized Bayes rule BR3 based on the minimum 
relative entropy principle (See (War05) and Section 9). After that we chose definitions 
CP1-CP4 to be compatible with this generalized Bayes rule. 
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11. Conclusions 

Density matrices are central to quantum physics. We utilize many mathematical techniques 
from that field to develop a Bayesian probability calculus for density matrices. Intuitively, 
the new calculus will be useful when the data likelihood D(y\M) has non-zero off-diagonal 
elements, i.e. information about which components are correlated or anti-correlated. The 
main new operation AQ B first takes logs of the matrices adds the logs and finally expo- 
nentiates. Any straightforward implementation of the operation requires the eigendecom- 
positions of the matrices, which are expensive to obtain. Throughout our work we notice 
that the log domain seems to be more important in the matrix case. 

Interestingly enough the operation has also been employed in computer graphics for 
combining affine transformation (Ale02). Also the simulation of quantum computations 
based on the Lie Trotter Formula ((NCOO), Chapter 4.7) can be interpreted as applying the 
operation to unitary matrices and not to symmetric positive definite matrices as we do 
in this paper. 

The main update in quantum physic is a unitary evolution of the current density matrix 
A, i.e. A := UAU T , where U is unitary. For example, the main differential equation for 
density matrices in quantum physics is the following version of the Schrodinger Equation 
(Fey72): 

dD(M\t) 
dt 

The solution has the form 



i (H D(M\t) - D(M\t) H), where H is skew Hermitian. 



D(M\t) = exp(-i t H) D(M|0) exp(i t H), 

where D(M|0) is the initial density matrix. Since itH is skew Hermitian, both exponentials 
are unitary. Thus the above update represents a unitary transformation of the initial density 
matrix D(M|0). Such transformations leave the eigenvalues unchanged and only affect the 
eigensystem. In contrast our generalized Bayes rule updates both the eigenvalues and 
eigenvectors, and the conventional Bayes rule can be seen as only updating the eigenvalues 
while keeping the eigenvectors fixed. Therefore the Bayes rules are decidedly not unitary 
updates. 

For the sake of completeness we now express the Bayes rules also as solutions to differ- 
ential equations. In the conventional case, the differential equations are (1 < i < n): 



dt 

The solution is 



9l0gP R \ m) = log P(y\M t ) - ^P(M J |t)logP(y|M, 

3 

P( M\t\- p (Mi\o)P(ymy 

{ *' ' £.-P(M j |0)P( 1 ,|M^- 



If we take the value P(Mj|0) as the prior P(Mj) then the expression for P(Mj|l) becomes 
the conventional Bayes rule (9.1). There is a similar differential equation for the generalized 
Bayes rule (For the sake of simplicity we assume that the prior D(M) and data likelihood 
matrix D(y\M) are strictly positive definite): 

dlogD{ 



at 



logD(y\M) -tr(D(M|t)logD(y|M)). 
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The solution has the form 



D(U\t) = 



exp (log D(M|0) + t log D(y\M)) 



tr (exp(log£>(M|0) + t log D(y\M))) 



(4_i) r>(M|O)0£>(j/|M)' 



tr (U(M|0) D(y\M.y) ' 



If we set Z>(M|0) to the prior D(M), then the expression for D(M|1) becomes the generalized 
Bayes rule (9.2). Notice again that the differential equations emphasize the log domain and 
that the operation appears in the solution. 

At this point we have no convincing application for the new probability calculus. How- 
ever, a similar methodology was used to derive and prove bounds for parameter updates 
of density matrices that led to a version of Boosting (TRW05) where the distribution over 
the examples is replaced by a density matrix, an online variance minimization algorithm 
where the parameter space is the unit ball ( WK06b) , and an on-line algorithm for Principal 
Component Analysis (WK06a). 

In this paper our parameters expressing the uncertainty are symmetric positive definite 
matrices. However using essentially the EG± transformation (KW97), it has been shown 
recently that inference can be done with arbitrarily shaped matrices (War07). This leaves 
the strong possibility that the calculus developed here will generalize to arbitrary shaped 
matrices as well. In that case the elementary events are "asymmetric dyads" uv T and the 
underlying decomposition is the SVD decomposition. 

The new calculus seems to be rich enough to bring out some of the interesting phenomena 
of quantum physics, such as superposition and entanglement. Maybe the new calculus can 
be used to maintain "uncertainty" in quantum computation. 

On a more technical note, we conjecture that for all non-decoupled joints D(A, B) there 
is a one-to-one mapping to the conditionals D(A|B), and the EM-like algorithm given in 
Section 7 converges to D(B), s.t. D(A,B) = D(A|B) (J A D(B)). 

Finally, we will reason in a simple case that generalized probability space is more "con- 
nected" and a clever algorithm might be able to exploit this. Assume zero is encoded as 
the distribution (1,0) and one as the distribution (0, 1). Moving from the zero distribution 
to the one distributions can be done by lowering the probability of the first component and 
increasing the probability of the second. As density matrices, zero and one would be (Jo) 
and ( o i ) j respectively. Note that the eigensystem for both matrices is the identity matrix 
and there is now a second way to go from zero to one that keeps the eigenvalues/probabilities 
fixed but swaps the eigenvectors: 
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APPENDIX 

Appendix A. Quantum-Mechanical Interpretation of Conditional 
Probability D(a\b) 

We will now show how to interpret the conditional probability D(a\b) in terms of two 
quantum measurements. The two measurements will be performed one after another on 
the joint density D(A,M) and D(a\b) will be a probability of outcome 1 for the second 
measurement given the first measurement had outcome 1. First, we measure D(A,B) with 
event J A <8> bb T . Assume that we get outcome 1. Using the generalization of collapse rule 
for events (see e.g. (NCOO)), the successor density matrix can be computed as follows: 

d(a r - (/a ® bbT )^ A ' B ^ ® bbT ) 



; tr((/ A ®66 T )£)(A,B)(/ A 6b T )) 

The second measurement consists of measuring the updated joint with event aa T 
Now the probability for getting outcome 1 is computed as: 

tr(f}(A RVnn T 6, T \\ tr((J A (g, bb T )D(A, B)(I A (g, bb T )(aa T (g, 7 B )) 
tr(D(A,B)(oa ® I B )) tr((/ A ® 6bT)ZJ( A> B)(/ A ® W,t }) 

^P2+cycic tr(Z>(A,B) (aa T ®bb T )) D(a,b) 



tr(D(A,B)(J A <g> 66 T )) tr(D(A,B)(I A &b T )) ' 
The denominator can be simplified using partial trace properties: 

tr(D(A,B)(I A <g> feb T )) P = 2 tr(tr A (£>(A,B)(I A ® && T ))) P = 3 tr(tr A (£>(A,B)) bb T ) = D(b). 

V v ' 

D(B) 

Therefore the probability of outcome 1 on the second measurement (given the first outcome 
was 1) is: 

tr(£(A,B)(aa T ® /»)) = C l A D{a\b). 

D{b) 
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